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Natural language — spoken and signed — is a multichannel phenomenon, 
involving facial and body expression, and voice and visual intonation that is 
often used in the service of a social urge to communicate meaning. Given 
that iconicity seems easier and less abstract than making arbitrary connections 
between sound and meaning, iconicity and gesture have often been invoked in 
the origin of language alongside the urge to convey meaning. To get a fresh 
perspective, we critically distinguish the origin of a system capable of evolution 
from the subsequent evolution that system becomes capable of. Human 
language arose on a substrate of a system already capable of Darwinian evol- 
ution; the genetically supported uniquely human ability to learn a language 
reflects a key contact point between Darwinian evolution and language. 
Though implemented in brains generated by DNA symbols coding for protein 
meaning, the second higher-level symbol-using system of language now oper- 
ates in a world mostly decoupled from Darwinian evolutionary constraints. 
Examination of Darwinian evolution of vocal learning in other animals 
suggests that the initial fixation of a key prerequisite to language into the 
human genome may actually have required initially side-stepping not only ico- 
nicity, but the urge to mean itself. If sign languages came later, they would not 
have faced this constraint. 
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1. Introduction 

The origin of human language is intrinsically interesting to humans. This differ- 
ence between humans and other sentient animals must have been obvious to 
Palaeolithic humans; and the recent resurgence of interest in language origins 
(e.g. [1-10]) has been little hindered by the paucity of hard evidence (a Pleisto- 
cene video would be nice) — or sensible admonitions to attend to more tractable 
problems. The following attempts to bring a fresh perspective by using an ana- 
logy with the origin and evolution of cellular coding systems. In other places 
[11-15], I have argued that DNA-and-protein-based life and language-based 
human thought may have enough in common — as the only two naturally 
occurring examples of a system using long code strings to construct thousands 
of parallel self-assembling meaning strings — to make it fruitful to use one 
system to make analogical predictions about the other. Two insights reached 
in the course of developing that analogy are useful here: (i) the difference 
between the origin and the evolution of a symbol-using system and 
(ii) the critical role played at the origin of the system by intermediate strings 
of 'symbol-representation' segments with properties partway between symbol 
and meaning. This unconventional starting point leads us to a new physical 
perspective on iconicity and arbitrariness. 

Predictive analogy [16] — where a map of object properties and relations is 
drawn between a source and a target field to make predictions about less well- 
understood objects and relations in the target field — has served in the history of 
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science as a strategy for nudging the mind off the unconscious 
rails on which it often runs within a field. For example, Darwin's 
Origin of Species is essentially a book-length analogy between 
animal breeding (artificial selection) and a postulated analogue 
of it in nature untouched by humans (natural selection). 
Merely wandering off track, however, does not help unless sup- 
port for a resulting new idea can be adduced from within the 
target field. But analogical mapping can also provide insight 
into the source domain. Much of the present issue concerns an 
analogy between spoken language and signed language, 
which are implemented with different sensory and motor mod- 
alities and differ in their detailed structure and brain 
implementation. Our understanding of human language capa- 
bilities has been deepened and generalized by using that 
mapping in both directions. Certainly, the analogy with cellu- 
lar-level coding systems spans a much greater distance then 
that one. But that very distance from uniquely human abilities 
promotes objectivity. Interestingly, the analogy between 
human and cellular symbol use was used by the first molecular 
biologists to reason about cells, not the other way around [13,17]. 



2. Origin versus evolution 

In discussions about how life came into existence, it is common 
to distinguish the origin of life from the Darwinian evolution of 
life [18]. The core of every living cell is a system for converting 
genes into proteins — that is, DNA symbol sequences into 
amino acid meaning chains that spontaneously fold into the 
three-dimensional molecular machinery of the cell, which 
includes enzymes, receptors, fantastical diffusion-driven mem- 
brane rotors for generating ATP [19], force-producing strands 
that use ATP and so on. Though this is confusingly called 
'protein translation' for historical reasons (based on the orig- 
inal molecular biologists' casual and faulty analogy with 
Morse code!), it is now understood by molecular biologists 
as the process by which DNA symbol strings are turned into 
phenotypic protein meanings. 

The problem of the origin of such a system, however, is not 
really an evolutionary problem in the usual Darwinian sense of 
the word. If we shrink ourselves down to molecular size and 
look at what cells do, it becomes clearer that cells somehow 
invented a new kind of molecular-level intentionality — a way 
to partly overcome the deterministic thermodynamic buffetings 
to which all matter is subject. This does not imply that cells 
create mysterious, irreducible holistic forces; in fact, we have a 
deep mechanistic understanding of how they work. But it is 
a natural way of characterizing what goes on in cells that 
distinguishes it from the prebiotic chemical cycles in the 
atmosphere, in volcanism, exposed mountain ridges and 
sedimentation in rivers, shores, down to the ocean floor, all 
of which involve an energy-dissipating, order-creating, but 
non-biological kind of 'evolution'. 

The chemical soups out of which life arose were already 
complex systems containing many different types of dynami- 
cally stable units. For example, it is thought that among other 
things, prebiotic soups contained isolated amino acids, the 
eventual constituents of biotic protein chains. Cells, however, 
invented a way to encode, use and reproduce information 
about how to cause thousands of different chemical reactions 
in this soup to happen. The tricky part is that the code strings, 
as well as all of the interpreting apparatus for them, had to be 
in the soup where everything was still subject to the soup's 



deterministic buffetings. At the outset, proto-information 
must have been somehow partially hidden from the degrada- 
tive attack of the soup. But once cells came online, they were 
able to speed up many chemical reactions, prevent others, 
invent new ones that never used to happen at all, and 
above all, order, organize and compartmentalize the chemical 
reactions. In short, code-using cells took over forceful control 
of chemical phenomena in local regions of the otherwise still 
prebiotic soup at the energetic expense of their surroundings. 

But this 'evolution' from prebiotic to biotic systems was not 
modern Darwinian evolution. Until most of the code-using 
system was in place, bona fide Darwinian evolution as it is usually 
defined — heritable variations in fitness [20] — was not possible. 
Darwinian 'heritable variations in fitness' pre-supposes a geno- 
type/ phenotype distinction and full cellular intentionality — not 
mere replication in a biochemist's well-supplied reaction 
chamber. The central problem of the origin of the coding 
system in cellular life is to try to come up with pre-Darwinian 
reasons for how such an intentional system might have arisen 
out of prebiotic situations lacking intentionality. 



3. A second origin 

In thinking about the origin of the higher-level symbol-using 
system in human language and culture, the situation is quite 
a bit more complex as human language was built upon a pre- 
existing cellular genetic system that was already capable of 
Darwinian evolution. Despite huge differences in scale, it is 
a remarkable fact that the fundamental rate of meaning 
assembly in these two very differently scaled systems 
is almost the same — a handful of amino acids per second, a 
handful of word meanings per second. Despite this simi- 
larity, language-based cultural evolution is much faster 
than biological evolution; lately, it has gotten so fast that bio- 
logical evolution is effectively stationary by comparison 
(there is precious little 'nature untouched by humans' left). 
The faster speed of cultural, language-based evolution is lar- 
gely attributable to the fact that cellular symbols strings 
(DNA and RNA) are comprehended (turned into protein 
meanings) but never produced from meanings; cells instead 
have to wait a long time for favourable mutations to occur. 
Human language, by contrast, is a more dynamic, free-for- 
all, two-way system in which people willy-nilly inject 
mental-reaction-controlling speech symbol streams into each 
other's brains [13]. The great difference in effective evolution- 
ary velocity has made it difficult for transient language- and 
culturally-transmitted memes to be fixed in the much more 
slowly evolving genomes of long-lived humans. Despite 
their great reach and elegant variation, humans are much 
more genetically similar to each other than even small local 
populations of most other animal species. 

But there is one great point of interaction between the 
DNA-based genetic system and the language-based human 
cultural system — the genetic basis of the peculiar human abil- 
ity to readily learn a language. It seems seductively natural to 
try to come up with Darwinian evolutionary explanations for 
why this might have occurred, in the spirit of evolutionary 
psychology. By contrast, I think we may be able to make 
more progress by considering the origin of language as essen- 
tially a pre-evolutionary problem — that is, as the second 
origin of a symbol-using, and evolution-supporting system, 
one that partly relies on DNA-based symbols for its 



construction and persistence, but that is largely decoupled 
from biological evolution in its form and content. 

In fact, human language might best be thought of as a 
brain operating system that allowed us to partly overcome 
the constraints on the Darwinian biological evolution of be- 
haviour in much the same way that cells have partly 
overcome the deterministic constraints on the 'evolution' — 
now in the physicist's sense of 'the evolution of a dynamical 
system' — of prebiotic soups. From this analogical perspective, 
the origin of language may have involved the reinvention of 
the trick of hiding information 'in plain view' from the dissi- 
pative attack of the 'soup'. But this time, the 'soup' consisted 
of Darwinian constraints on the evolution of animal behav- 
iour. Instead of using Darwinian evolution to explain 
language, our goal is to see how the origin of language 
circumvented evolutionary psychology. 

4. The 'semantic urge' 

An assumption that lies behind many language origin scen- 
arios is something that I have called the 'semantic urge'. 
This intuition grows out of the fact, just mentioned above, 
that the human linguistic coding system was built on top of 
a lower-level biological coding system that was already 
capable of constructing sophisticated, non-linguistic cognitive 
systems such as those in parrots, whales, dogs and primates. 
The sustained goal-directedness of animals makes it very 
hard to avoid the notion that human language must have 
somehow grown out of an insistent craving of inarticulate 
homininds to communicate complex meanings to each 
other, perhaps initially by gesture, given that vocal motor 
control of the kind possessed by the primates most closely 
related to us probably was not up to the task. 

This is at heart a Baldwinian picture, where behaviour 
provides a selective context that drives standard Darwinian 
evolution [21]. As noted above, there is little tendency to 
fall back on intuitions like this in thinking about the origin 
of cellular life because cellular life had no analogous pre- 
existing, code-using system capable of goal-directed behav- 
iour beneath it. For example, one influential picture about 
the origin of life is that proto-symbol chains emerged first 
without standing for anything and then only later were 
taken over as a code for other meaning chains that could 
fold up and control chemical reactions. There is no Baldwi- 
nian semantic urge of prebiotic soups to control their 
surroundings that drives the emergence of cellular proto- 
symbol chains. The picture of proto-language that we arrive 
at by taking this analogy seriously is somewhat peculiar; 
but it fits much better with what we know about the evol- 
ution of vocal behaviour in other animals (see also 
[9,13,14,22,23]). Before doing that, let's first review some of 
the evidence for the origin of symbols and proto-symbols at 
the cellular level. 

5. The 'RNA world' without a semantic urge 

The original idea of an 'RNA world' [24] was independently 
proposed by Woese [25], Crick [26] and Orgel [27] as a prede- 
cessor of modern DNA/RNA/protein-based life. It gained 
major support from the unexpected demonstration in the 
early 1980s that RNA could act as a bona fide, enzyme-like 
catalyst (an RNA segment in single-celled Tetrahymena was 



discovered to fold up and catalyse RNA-splicing). The foun- 
dational role of RNA is obvious from observing its current 
position in cells (recent review: [28]). 

Unlike other components of the cell, RNA can act either as 
a one-dimensional symbol string (messenger RNA) or as a 
three-dimensional self-folding controller of chemical reac- 
tions (structural RNAs, based on the protein-like ability of 
RNA to form precisely shaped surface cavities with high 
specificity for particular substrates). This crucial Janus-like 
ability is used sparingly; modern cells mostly employ pro- 
teins, not RNAs, to control and catalyse the thousands of 
chemical reactions they run. The instances where RNA is 
used as a protein-like structure, however, stand at the very 
centre of code-use in cells, including (i) RNA editing (in the 
nucleolus, which generates ribosomes, and in the spliceo- 
some, which edits messenger RNA), (ii) recognizing words 
(codons) in code-like messenger RNA strings (by transfer 
RNA) and, most importantly, (iii) using the sequence of 
recognized words in messenger RNA code strings to assem- 
ble individual amino acid meaning units into functional 
protein chains (by ribosomal RNA). 

The idea that the dual roles of RNA as code and catalyst 
might have bootstrapped life gained support over the years 
as additional catalytic RNAs were discovered, and especially 
after large-scale structural RNAs (e.g. ribosomes) were finally 
crystallized showing that it was the RNA itself, not the 
associated proteins, that directly catalysed the attachment of 
each coded-for amino acid 'meaning unit' onto the growing 
protein chain [29-31]. 

6. The origin of proto-information: before the 
RNA world 

Despite the intuitive plausibility of the 'RNA world', how- 
ever, it has proved to difficult to find plausible prebiotic 
synthesis pathways for nucleotides, the subunits of RNA 
[32]. This stands in sharp contrast to the easy prebiotic avail- 
ability of amino acids [33,34], the eventual molecular units of 
meaning. Many origin-of-life researchers were led to search 
for chain-like prebiotic precursors of RNA itself based on 
more easily obtainable subunits than nucleotides [35]. A 
key feature of this search — so obvious to those within the 
field of prebiotic synthesis that it is rarely explicitly stated — 
is to find reasons other than the ability to code for amino 
acid meanings as to why a pre-RNA-like molecule might 
have come into existence. 

Once pre-RNA or RNA or existed, its dual role as a cata- 
lyst (protein-like) and a code-chain (DNA-like) could then be 
discovered, leading to something like an RNA world [36-38], 
and then finally to modern DNA/ RNA /protein life, where 
code-only DNA and meaning-only proteins occupy opposite 
ends of the spectrum from symbol to meaning. 

7. Problems with the semantic urge in the 
prelinguistic world 

In turning back to language, many language origins scenarios 
start with a repertoire of already meaningful vocalizations 
like those used by many different animal species [39] and 
then attempt to come up with a reason — typically, the seman- 
tic urge — for why they might have multiplied [40,41]. Several 



people [42,43] have pointed out that there is a major problem 
with this standard scenario. Animal calls — such as the well- 
studied set of vervet monkey alarm calls [44], but also calls 
in many other animal species — are laden with emotional 
meaning. The usual explanation for this is that alarm calls 
often are emitted in life and death situations, which generates 
strong selection pressure to maintain call reliability; this 
reliability has been ensured across many diverse species by 
tightly neurally tying calls to the emotional state of both 
the sender and the receiver. This tight linkage, however, pre- 
sents a problem for the scenario of calls as a starting point for 
proto-language; the number of different emotional states is 
rather small, and emotional states do not follow each other 
in a quick, regular succession like words must do (see 
below, on the need for speed). 

The origin of language required the development of a 
large inventory of sound combinations in order to code for 
thousands of meanings that are made more specific by assem- 
bling them into strings at the rapid rate of several words per 
second. Individual word meanings — especially for high- 
frequency polysemous words central to every language, 
such as 'over', 'put', 'give', Tine', 'big' or 'hand' — are freed 
from emotion when compared to animal calls. Certainly, 
some single words such as epithets can be intrinsically emot- 
ive, but these are in a small minority. Perhaps, the difficulty 
of imagining a path from a handful of emotive calls to the 
5000-word core of mostly emotionally neutral words in 
human language stems from the fact that the two are phylo- 
genetically unrelated. The analogy with the RNA world and 
the pre-RNA world suggests that perhaps we should instead 
try to find a way by which a large pool of pre-RNA-like pre- 
words might have been generated — units that are like words 
in many sensory and motor respects, but that do not yet stand 
for anything. 

In the cellular situation, the bonds between any pair 
of RNA nucleotides in a chain are approximately equally 
stable. In Sereno [13], I called this arbitrariness2 (to distin- 
guish it from classical pairwise symbol-object Saussurean 
arbitrariness!). Arbitrariness 2 can be thought of as a trick to 
prevent the prebiotic soup from recognizing and thus selec- 
tively destroying the incipient information 'hiding' there in 
the form of different sequences. For pre-language, we may 
have needed a similar trick to prevent individual units or 
small groups of them in incipient speech symbol chains from 
(initially) being recognized as meaningful, tonicity — but also 
any accidental arbitrary! (Saussurean) attachment of the 
vocal pre-symbol groups to emotional meaning — may have 
actually been a Darwinian evolutionary impediment to the 
development of a large enough catalogue of vocal 
pre-symbol segments. As in the molecular case, proto- 
information at the linguistic level may initially have been 
hidden from the 'soup' of animal behaviour 'in plain sight' 
to allow it to accrete, and to avoid having it attacked and 
destroyed by having it elicit particular behaviours in other ani- 
mals, which would then drag everything back into the 
simplified sound features and semantics, and automatic 
emotional binding of isolated alarm calls and affiliative calls. 



8. Information versus proto-information 

In Shannon- Weaver information theory, the more random 
and unpredictable a signal on a channel is (by analogy with 



chemical entropy), the more information it can potentially 
convey [45]. An image with large patches of white and black 
transmitted pixel by pixel is a very non-random signal. If 
this image is first compressed — e.g. by sending the number 
of white or black pixels in an all-white or all-black run 
rather than each pixel separately — the signal would be shor- 
tened while containing the same amount of information, but 
it would also begin to look more like unpredictable random 
noise. Information theory assumes that the channel — but also 
the machinery at either end that can send and understand 
the message — are both already there. 

In thinking about origins, we have instead to think about 
how the channel, interpreters and most importantly the one- 
dimensional strings themselves came into being in the first 
place. There is an endless supply of 'uselessly random' 
things in the world that are not information or even proto- 
information because there is no way to access the strings in 
them, or because they do not even form one-dimensional 
strings at all (like disconnected gas molecules). In contrast 
to usual language origin scenarios that start with single 
meaningful Saussurean sign-object pairs, the perspective 
here is to look instead for plausible sources of proto-infor- 
mation strings — currently unused, meaning nothing, but 
potentially usable. Good quality proto-information should be 
random-appearing, like efficient Shannon information, 
which requires the ability to attach symbol segments into 
one-dimensional strings in roughly random orders. But just 
as critically, there needs to be a pre-existing productive mech- 
anism for generating and processing these one-dimensional 
strings in plain view, so they could eventually be integrated 
into a meaning delivery system, but with their content 
initially hidden because of their apparent randomness. 

Strong one-dimensional chains (of nucleotides, amino acids 
and saccharides) are what principally physically distinguish 
the current biotic molecular world from the prebiotic world 
of rocks and minerals. These chains are held together by 
strong covalent bonds in aqueous solution. Rocks and minerals 
are certainly stronger than biomolecules when dry; but once 
dissolved in water (where all the action is), the chains and lat- 
tices in rocks and minerals all break down into small pieces. 
Human language is distinguished from animal communication 
in an analogous way — the towering sequence complexity of the 
long one-dimensional chains in human language set it far apart 
from meaning-carrying signal systems in every other animal. 
The analogy with the RNA world suggests that we look for 
sources of meaningless proto-information chains rather than 
the more traditional and intuitive approach of trying to find 
ways of making sentence-like chains out of already bound 
together but isolated symbol-referent pairs. Evidence from 
the evolution of animal vocal behaviour provides intriguing 
sources for such meaningless chains. 



9. The example of birdsong 

The possible relation between birdsong and speech was noted 
early on. Darwin — who in Origin of Species [46] often discussed 
the relation between biological and linguistic evolution 
(interestingly, to argue that biological evolution might be like 
language evolution, not vice versa\) — turned briefly, in the 
Descent of Man [47], to language origins. Darwin was especially 
fond of scenarios in which a structure had initially evolved for 
one purpose only to become a 'preadaptation' for another. 



Breaking with the common sense view that language arose from 
a gestural substrate, he suggested that language developed out 
of a form of 'rudimentary song', a kind of purely prosodic pre- 
language that conveyed emotions and other broad, unitary 
meanings in much the same way that pitch modulation and 
emphasis are used in modern speech. Darwin mentions the 
flashy hooting vocalizations of gibbons, which are generated 
during territorial and courtship displays, as something like 
what he had in mind, but pointed out that birdsong provided 
'in several respects the nearest analogy to language' ([47], 
p. 55), citing the work of Daines Barrington a century before 
[48] on the extended learning period for birdsong, the initial 
'babbling' stage and the development of birdsong dialects. 

Modern research on birdsong has provided a neurobiologi- 
cal foundation for these earlier hunches, but has also revealed 
a system that looks a good deal more like human-style, 'left- 
hemisphere' speech than like the call systems of other animals 
(including gibbons) [10,49-51], but also the call systems of 
songbirds themselves, who have retained their limited set of 
emotional calls alongside song. There is a powerful perennial 
tendency outside fields explicitly focused on evolutionary pro- 
cesses to think of evolution in terms of a 'Great Chain of Being' 
and to ignore the mosaic nature of evolution. Thus, birdsong 
has often been dismissed as a model of human language for 
the reason that monkeys seem much smarter than some 
birds, or that monkey calls seem to have more semantic con- 
tent than birdsong. In fact, the importance of birdsong in the 
present context (only dimly glimpsed by Darwin) is exactly 
the fact that a set of language-like features have evolved 
in the absence of a semantic function. 

Birdsong requires a significant learning period, during the 
early parts of which the young bird is silent. If a bird is not 
exposed to a tutor song within a certain early critical period, 
it will produce only a crude version of its species' song. Nor- 
mally exposed young birds initially produce sounds called 
subsong that resemble the progression of types of babbling 
in baby humans — initially a broad range of sounds are pro- 
duced, followed by an unorganized recombination of 
species-specific song fragments, and then finally, adult song. 
Within a species, there are regional dialects that are learned 
from a bird's regional peers; artificial rearing experiments 
show that birds learn the dialect of their tutors, regardless of 
their genetic background. Adult song repertoires can be con- 
siderable; some wrens produce hundreds of distinct songs, 
each containing 5-20 'syllables', while mockingbirds produce 
virtually endless sequences of different syllables in variable 
orders. Good singers may have a thousand or more distinct 
'syllables' (a 'syllable' consists of a particular figure sometimes 
repeated once or twice; in this respect, it is unlike a phonetic 
syllable, which consists of one or more consonants and a 
vowel). If a songbird is deafened before learning to sing, it 
will fail to produce song-like sounds as an adult. By contrast, 
non-song birds and many other animals including non-human 
primates (e.g. gibbons) that do not learn complex serial vocal 
patterns from their peers, still come to produce their species- 
specific sound repertoire when deafened at birth [52]. In 
many respects, it might be more accurate to call it 'birdspeech', 
because birdsong differs from human singing and musical per- 
formance in many ways; for example, birdsong lacks a regular 
metre, musical tonality and harmony (though see European 
starlings on 'harmony' in §13). 

The parallel evolution of fine-grained vocal control in singing 
birds affords a crucial comparative perspective on the anatomical 



and neural constraints on auditory-motor learning and perform- 
ance. Birdsong is initiated in a structure called the syrinx, which is 
evolutionarily related to (and controlled by the same nerve as) 
the tongue. It is generated primarily by directly controlling the 
fundamental frequency produced by the syrinx. Human 
speech sounds, by contrast, are generated by filtering and mod- 
ulating the higher harmonics of the fundamental frequency of the 
vocal cords in the larynx (by controlling the position of the 
tongue in the pharyngeal and oral cavities), making the higher 
frequency parts of speech sounds independent of fundamental 
frequency (voice pitch). Nonetheless, in many respects, birdsong 
is much more like human speech than are the vocalizations of 
other animals, some of which can even modulate laryngeal har- 
monics (for example, monkeys [53], male deer [54] and seals [55]) 
in a human speech-like fashion. 

There are intriguing clues about the evolution of fine vocal 
control from neuroanatomy of the avian song system. For 
example, motor output neurons in the forebrain (in the 
robust nucleus of the arcopallium, RA) of songbirds have 
gained direct access to motoneurons controlling the syrinx 
vocalization musculature. Projections from RA bypass the 
brainstem pattern generator circuitry for calls through which 
all forebrain outputs must pass in non-song birds like ducks 
[56,57] but also in squirrel monkeys [58,59] and macaque mon- 
keys [60]. There is a striking parallel here to the evolution of 
fine finger control in primates (but also finger control in rac- 
coons, as a yet another reminder that evolution is a bush, 
not a linear Great Chain of Being), where motor cortex neurons 
have also come to contact finger motoneurons directly, bypass- 
ing pattern generators for coordinated limb movement 
situated in the spinal cord; hand motor cortex in cats, by con- 
trast, contacts primarily the spinal pattern generators, which 
then have the only private access to motoneurons. The more 
direct access afforded to the forebrain in the case of the song- 
bird syrinx and the primate and raccoon hand presumably 
underlies more complex, differentiated, learned control of 
these effectors. Note that this means that the relevant forebrain 
motor output areas have essentially come to assume a lower 
level in the motor control hierarchy, allowing the development 
of other forebrain pattern-generating centres that can operate 
in parallel with and independent of the brainstem and spinal 
pattern-generating circuitry that is still needed for locomotion 
in the case of the hand, and non-song vocalization in the case 
of the syrinx. 

10. Speech-like birdsong carries less meaning 
than vocal call systems do 

The most striking characteristic of birdsong, however, in 
light of its prodigious complexity, is its essential lack of 
semantic content. Individual syllables or song fragments do 
not seem to have any specific meaning outside of being 
part of a particular song; and particular songs do not seem 
to convey specific content. Nor do birds appear to produce 
anything like 'words' by recombining their 'syllables' in 
order to signify concepts. Despite having motor, vocaliza- 
tion and auditory equipment ideally suited to support the 
re-combinable speech-symbol half of a language-like 
meaning-conveying system, birdsong seems to communicate 
only very general meanings. Songs serve to mark territories, 
identify the singer's species, attract mates and cause ovula- 
tion, often all at once. The messages communicated by 



birdsong are, in fact, less content-filled than the messages com- 
municated by, for example, vervet monkey calls — which have 
been shown to signify rather elaborate distinctions among 
predators and conspecifics [44], despite the fact that these 
unlearned, unitary calls are drastically simpler than birdsong. 
This difference in referential content is particularly obvious 
when we consider the 'meaning' of a handful of syllables of a 
songbird's song; though emotion is keenly involved in motivat- 
ing the bird to begin singing, the identity and order of syllables 
carry no additional specific emotional baggage. 

Attempts to find particular ordering patterns in bird song 
are a topic of hot debate, particularly with respect to what 
level of complexity of song grammars birds are able to recog- 
nize after training [61]. Without visiting that particular 
debate, in the case of natural song sequences, researchers 
have shown that song order is not random, but can be mod- 
elled with hidden Markov models [62]. However, in support 
of the present line of argument on the meaninglessness of 
sequences in birdsong, there is little evidence that the differ- 
ent naturally occurring syllable and song orders themselves 
signify different things, at least from observational studies 
of the behaviours of singer or listener birds. 



11. Sexual selection and birdsong 

One plausible theory about birdsong is that it was a product of 
runaway sexual selection — like the male peacock tail or outsize 
antlers in male deer, or huge inconvenient-looking sexual 
swellings in female baboons and female chimpanzees. Elabor- 
ate singing abilities seem to have been preferred by mates, 
despite making little direct contribution to fitness beyond the 
fact that they were preferred. Sexual selection stands in con- 
trast to natural selection, which rewards improved function 
like a stronger beak or more efficient wings. Certainly, a com- 
plex song can serve as a sign of a mate fitter in other non-song 
respects. It is a little more difficult to explain the maintenance 
of extreme examples this way, especially when sexually 
selected features run the risk of impeding other functions 
(huge antlers) or attracting predators (elaborate vocal dis- 
plays). Zahavi [63] has suggested that these handicaps have 
evolved to serve as an honest signal; the feature advertises 
that the animal was fit enough to overcome the handicap. 
Though empirical support for the handicap model from 
animal studies has been mixed, there is little doubt that 
sexual selection in general can drive evolution in a different 
direction from natural (functional) selection. 

Sexual selection is not confined to female choice affecting 
male characters (see primate sexual swellings above). And 
both the male and female sing in some songbird species. 
Bay wren male-female monogamous pairs, for example, exe- 
cute precisely coordinated 'duets' where the pair trade 
singing back and forth several times a second, creating 
what sounds to an untrained ear like the song of a single 
bird (see example in [23]). In these birds, the song control 
nuclei are large and hormone-sensitive in females as well as 
males [64]. The generally accepted explanation for this behav- 
iour is that the attractiveness of the male's song to listening 
females is reduced when a duetting female is heard 
intimately trading back and forth with that male. 

Several whale species have independently evolved a vocal 
learning system that resembles birdsong in a large number of 
respects and provides a key additional example of how a 



speech-like vocal learning system can evolve without a 
'semantic urge' [65]. Humpback whales learn to precisely 
reproduce long sequences of sounds and culturally transmit 
them to animals that are genetically unrelated. The main 
difference is that whale songs are lower in pitch, and individ- 
ual songs unfold over a minute instead of several seconds. 
The underwater acoustic environment of the ocean is quite 
reverberant, due to the faster and more efficient propagation 
of sound in water and as a result of reflections from the 
air-water boundary. This may be one reason for whales' 
more leisurely tempi. As with birdsong, whale song has 
social and sexual functions, and precise, lengthy sequence 
perception and generation. 

12. The need for speed 

One idea implicitly introduced above was the notion that 
meaning assembly in language might demand a certain mini- 
mum speed, like flying. At first, it might seem that there is no 
'minimum speed' for language as strongly motivated humans 
(e.g. Stephen Hawking) are capable of comprehending and 
producing language at very slow rates (e.g. one word per 
minute). However, in the context of the origin of language, 
for a proto-meaning-assembly process to be useful in a social 
context among proto-linguistic animals, it is less clear that 
such extremely leisurely rates would be practical; the chance 
of non-linguistic interactions and events disturbing the mean- 
ing assembly process increases as the interword time goes up. 
A second motivation for speed is that if language meaning 
assembly piggy-backed on non-linguistic visual scene assem- 
bly (see below), the word-meanings-per-second rate might 
initially have had to more closely match the typical rate of 
uptake of sequential glances used by the visual system (several 
new fixations per second). 

These two considerations may provide additional 
independent motivations for why elaborate, rapid, but non- 
meaningful perceptual and motor string sequencing might have 
had to evolve first in order to boot language. These arguments 
are relevant not only for vocal signalling but also for visuoman- 
ual signalling. It is important not to forget that a small set of 
meaningful alarm calls and meaningful gestures have indepen- 
dently evolved in a very large number of different animal 
species for a half a billion years. The evolutionary advantage 
to a species of being able to communicate more complex mean- 
ings would have been (and remains) great; that is, there must 
have been countless opportunities to extend unitary pre-existing 
meaningful calls or gestures, but none were able to be taken. 

13. A birdsong-like 'RNA world' for pre-language 

With the context provided above, we can see our way to a 
surprising extension of Darwin's language-origins theory. 
On the evidence of the avian case, it seems possible that 
early hominids might have initially evolved an elaborate 
system of essentially phonetic vocalizations — a kind 
of 'song talk' with no attached semantics — as a result of 
sexual selection. In this view, a number of the specializations 
for speech-related auditory -vocal control evolved for entirely 
non-semantic reasons. Perhaps early hominid pairs 'duetted' 
like bay wrens for several million years before reference was 
invented. At first, it might seem unreasonable to imagine 
almost fully developed human speech without meaning; 



but the surprising variety of birdsong — and the parallel 
whale example — suggest that the evolution of the ability to 
generate elaborate but meaningless vocal sequences is a 
reasonably common, evolutionarily stable strategy. Turning 
standard Baldwinian language origins scenarios on their 
heads, the pre-adapted 'symbols'-without-meaning system 
might have only been taken over for use as a semantic vehicle 
at the very last moment. The scenario of talking before refer- 
ential speech is odd, but no odder than birdsong and whale 
song themselves. 

This scenario contrasts with Fitch's idea that laryngeal 
descent in hominids (which occurs early in postnatal devel- 
opment) might have been used as a strategy to indicate 
large size in males, by analogy with the realtime laryngeal 
descent that occurs during calls made by rutting male deer 
[54]. A functionally similar kind of call, though using res- 
onant air sacs instead of laryngeal descent, is well known in 
gibbons and orangutans, and it closely resembles emotional- 
meaning-laden signals in standard animal call systems. 
Unlike birdsong, these primate calls develop even in 
deafened animals, indicating that learning is not required. 

The birdsong model suggests instead that there was 
runaway selection for complex sequences of essentially mean- 
ingless segments — each individually untied from particular 
emotions — as opposed to selection for a large-sounding roar 
or a deep voice. This is not to detract from throaty roars, 
which are a common theme in male animal vocalizations; but 
elaborate sequences can be just as attractive as deep throatiness. 
Different bird lineages have explored many intriguing ways of 
attractively increasing song complexity. The superb lyrebird 
('superb' is part of the common name) accurately imitates 
sounds in its forest environment, including the sounds of 
other birds and other animals, and poignantly, the sound of 
chain saws, attracting mates by means of a large and impress- 
ively detailed repertoire [66]. European starling males can 
independently manipulate the two halves of their syrinx 
during singing to create a 'one man band' effect with a lower 
frequency 'bass line' overlaid by asynchronous higher pitched 
notes. Experiments with playback show that the more complex 
songs sung by older, more experienced starling males are more 
effective at inducing ovulation [67]. Finally, although male plus 
female singing like the wren example discussed above is less 
common than male-only singing, it has nevertheless evolved 
in more than one lineage (e.g. female superb lyrebirds are 
quite competent imitators), overcoming objections (e.g. [68], 
p. 96) that sexual selection cannot explain female speech. 

14. Analogues of structural and catalytic RNA in 
the auditory system 

As mentioned above, RNA molecules serve both as a code 
(messenger RNA), but also critically as non-code-like, self- 
folding word-recognition devices (transfer RNA) and cataly- 
tic meaning chain-assembly devices (ribosomal RNA). By 
analogy, the internal representations of speech sound 
sequences that a primate neurobiologist would expect to 
find in the human superior lateral temporal cortex may 
have acquired other functions besides merely serving as 
internal copies of the speech stream. Perhaps there was a lei- 
surely 'RNA world '-like stage as sexual selection was shaping 
vocal learning, where RNA-like speech sound representations 
in the auditory temporal lobe interacted with each other and 



gradually increased in complexity without being attached to 
meanings (e.g. visual meanings). 

Then at a later point, the 'catalytic' abilities of uninter- 
preted speech streams were suddenly exposed in the 
service of attaching visual meaning representations into 
chains in a manner similar to the ribosomal catalysis of 
amino acid chains. Non-symbolic functions for internal rep- 
resentations of uninterpreted speech streams is a strange 
idea that would require more evidence than we currently 
have for it to be taken seriously; however, the central role 
of structural/catalytic RNA in protein synthesis (as opposed 
to being a mere code-like messenger) is arguably just as 
strange — and it took many years before that idea was finally 
fully accepted. 

Nothing has been said (1) about how the internal represen- 
tations of speech sounds got connected to visual meanings, or 
(2) about the dynamics of how concatenated visual meaning 
patterns interact in the complex 'mental metabolism' that 
must be present in linguistically competent human brains. 
But perhaps that second bit did not have to be invented out 
of whole cloth. Instead, it could have piggy-backed on an 
already existing system for assembling visual inputs arriving 
from early visual areas during the process of visual scene com- 
prehension; the higher-level visual system was already an 
expert in the rapid serial assembly of successive glances. The 
implication is that the trick of language was not to have 
invented the basic meaningful units — nor even the rules for 
how concatenated chains of visual units self-assemble — but 
merely to have found a symbol-string-directed way of 
making standardized connections between them [3,11-13,15]. 

15. Language as code-directed scene 
comprehension 

Vision is very important to primates; in fact, almost half of the 
cortex in primates consists of areas that are specialized for 
visual processing. Primate auditory and somatosensory areas 
each cover about one-quarter as much cortical surface area 
as visual areas do [69,70]. Together with the fact that virtually 
all anthropoid primates (monkeys, apes and humans) are diur- 
nal, it makes sense for a substantial core of concrete word 
meanings concerning objects, properties, actions, manners 
and paths to be represented in the visual system. The idea 
that visual representations may be important in the semantics 
of natural language [3] has been around for a long time (in lin- 
guistics; see [71-74]). Common to those approaches is the 
notion that concrete visual meanings have been extended via 
analogical processes to deal with more abstract objects and 
relations. An unremarkable sentence like T think I got my 
idea across to him' uses unmarked 'get', 'across' and 'him' 
as if the abstract concept 'idea' were a physical object being 
transported across a physical bridge from me to him. The pre- 
sent proposal goes further in suggesting a particularly direct 
moment-to-moment relationship between the mechanisms of 
scene and discourse comprehension. 

Language — especially when transcribed to text — quite 
obviously has a fundamentally serial nature. At first, vision 
might seem to be less serial. This is in large part because we 
cannot as easily and compactly 'write' vision as we can language. 
But from the point of view of primary visual cortex (or the view 
of a filmmaker), the integration of successive glances in the com- 
prehension of a visual scene requires a kind of serial assembly 



operation similar in a number of respects to the serial integration 
of word meanings in discourse comprehension. Primates make 
long series of fixations at the rate of several new views per 
second during scene comprehension. Each fixation brings the 
high-receptor-density fovea of the retina to a new part of the 
visual scene and generates new activity in VI dominated by 
objects at fixation, which largely displaces the activity there 
caused by the previous fixation. Higher visual areas with less 
precise retinotopy somehow integrate information from these 
disconnected and distorted (centre-magnified) activity sequences 
across time (e.g. [75]) to generate an internal representation of the 
location, identity and relations of the relevant objects in the cur- 
rent scene that serves as a basis for moment-to-moment action. 
There are a number of aspects of this initially strictly serial pro- 
cess that are strongly reminiscent of serial meaning integration 
in language comprehension. 

For example, isolated glances taken out of context are as 
underspecified and polysemous as a single word taken out of 
discourse context. The context-free information available from 
an isolated 250 ms glance at a common object — e.g. a leaf on 
a nearby branch — could mean a lot of different things in the 
context of preceding and following glances — e.g. something to 
walk on (the branch), something to eat, something to duck 
under, something to grab onto to correct balance, something 
to brush aside, something to shake or something completely 
irrelevant. The specific visual meaning in a single glance is 
only sharpened and fully developed after considering the con- 
text of the full train of glances (and motor state) that came 
before and after it. This is similar to the integration of linguistic 
meaning from word strings. An isolated 250 ms experience of a 
high-frequency word like 'line' taken out of context is as poly- 
semous as the glance at a leaf on a branch; it could be about 
'line up those objects', or 'line up a supplier', or a 'line of 
kings' or a 'line of thought', 'in line with what I'm thinking', 
or 'don't cross that line', or prosaically, 'a clothes line'. 
Though the meanings conjured up here are often listed in a dic- 
tionary under 'line', most of the work may instead get done by 
interaction with context, as in the case of the leafy branch; a 
truly isolated experience of the unique semantic content of 
'line' (with anaesthesia before and after) is probably much 
more minimal than what is found in the dictionary. The 
impetus for thinking this way came from considering the 
truly unremarkable chemical properties of isolated amino 
acids — the word meanings at the molecular level — compared 
to their incredible, multifarious catalytic specificity upon 
their mere concatenation into self-folding amino acid chains. 

Second, there is a physical divide in the visual system 
between motion processing — in the middle temporal area 
(MT), the medial superior temporal area dorsal division 
(MSTd), the lateral intraparietal area (LIP), the ventral intra- 
parietal area (VIP) — and object property processing — shape, 
colour and object identity in the fourth visual area (V4) and 
inferotemporal areas. Though the properties of neurons in the 
two pathways are somewhat less distinct than once thought 
(e.g. [76]), and recent comprehensive analysis of connections 
has turned up a large number of additional (though numeri- 
cally sparse) connections [70], this basic division remains the 
first principal component of organization in the visual system. 
In language, there is a similar primary divide between noun/ 
adjective and verb /manner /path in phrasal syntax. 

Third, information from temporally distant glances fixating 
roughly the same thing must be tied together, as with linguis- 
tic anaphora. For example, an initial glance at a particular 



monkey's face might be followed by glances at a branch, the 
ground, a different monkey, a cloud (looking for an eagle) or 
a flower, before the same monkey is finally re-fixated. At 
that moment, the particular monkey must be re-identified 
(e.g. the fixation might have fallen on the target monkey's 
shoulder rather than the face, or mostly in the opposite hemi- 
field) and then information from the previous fixation must be 
integrated with information from the current fixation; for 
example, that particular monkey could be looking more 
aggressive than he was at last glance, or now he has turned 
away, or now he is eating a small plant root, or now he is in 
the same position and mood as he was in the previous glance. 

In both signed and spoken language, there are a variety of 
long distance anaphoric dependencies and interactions estab- 
lished as a serial discourse is played out. These range from 
pronouns 'I told John about the job; he agreed to do if, to pro- 
nominals, to more complex pointers in realistic multisentence 
discourses. For example, I could metonymically refer to the con- 
tent of the entire previous paragraph with 'that crazy "visual 
anaphora" idea'. 

None of this implies that scene representations (or their 
presumed linguistic fellows) need look anything like pictures; 
the patterns in question are very likely distributed across 
many areas, a number of which show little retinotopy. 

One main difference between scene and discourse com- 
prehension is, of course, that scene comprehension is tied 
closely to the current scene. Discourse comprehension could 
be thought of as a kind of fictive visual scene comprehension 
directed, in the case of spoken language comprehension, by 
sequences of phoneme representations in secondary auditory 
cortex. The obvious advantage of linguistic discourse 
comprehension is that we are no longer tied to the current 
scene. However, once the appropriate visual word meaning 
patterns have been called up and bound together, the 
nature and interactions of the composite pattern might be 
conditioned mainly by the prelinguistic rules of interaction 
of scene representations in primate visual areas networks. 
In this sense, a part of what has been called linguistic 
syntax and semantics might not be modular with respect to 
the neuroscience of vision. 

There is in fact substantial evidence that visual areas in 
humans are involved in specifically linguistic functions. There 
is a kind of aphasia that is self-contradictorily called 'transcorti- 
cal sensory' aphasia (i.e. 'across-from-the-language-cortex' 
aphasia!) that is generated by a lesion in left human inferotem- 
poral cortex [77]. Many of these lesions are so posterior and 
ventral that they are associated with overt visual field defects. 
Transcortical sensory aphasics have poor, Wemicke's-like com- 
prehension, yet paradoxically (at least in the context of 
traditional models of language comprehension) can repeat 
words effortlessly. Far from being 'across from the language 
cortex', the visual areas in inferotemporal cortex damaged in 
these patients may be one primary site of semantic processing 
in sighted humans. More recent cortical stimulation studies in 
intact (but epileptic) brains have demonstrated language arrest 
from inferotemporal cortex stimulation ([78]; see also [79]) and 
have also shown that middle temporal gyrus stimulation 
(between auditory superior temporal cortex and visual inferior 
temporal cortex) can create transcortical sensory aphasia-like be- 
haviour (preserved repetition with poor comprehension) but 
without impaired naming, pointing to the middle temporal 
gyrus region as a key bridge point between input phonology 
and more inferior ly located visual semantics [80]. 



Transcortical sensory aphasics recover more quickly than 
patients with more superior lesions, but this may only be an indi- 
cation that the functions performed by visual cortex in language 
comprehension are less lateralized than those performed by 
auditory cortex and face motor cortex. This is consistent with 
what we know about primate temporal visual areas; it has 
long been known that permanent deficits in visual pattern recog- 
nition in monkeys require bilateral inferotemporal cortex lesions 
[81]. There is no need to assume that all the cortical areas 
involved in language comprehension are equally lateralized. 

Psycholinguistic experiments using pictures inserted into 
sentences and priming between words and pictures [82,83] 
suggest that it is surprisingly easy for visually represented con- 
cepts to be integrated into ongoing linguistic discourse 
comprehension or for words to directly activate visual perceptual 
representations. This may be another indicator of the closeness of 
visual category representations to linguistic meanings. Certainly, 
it is easier to make little pictures or movies to represent a large 
range of word meanings than it is to make non-linguistic 
sounds (how to purely auditorily represent 'over' or 'give'?). 

It is certainly not necessary to site all meaning in the 
visual system. There is a rich world of non-symbolic auditory 
and somatosensory experience. And a similar serial scene 
assembly process to that described above for vision, but for 
sequences of sounds or touches, must be done in those mod- 
alities, too. And there are many other animals for whom 
vision is not the primary distal sense, such as echolocating 
bats or electric fish (many of which are nearly blind); bats 
and electric fish must rely primarily on serial assembly of 
auditory and electrosensory 'glances'; for a bat, it would 
probably be easier to comprehend an auditory 'picture' of 
'over' — a small object moving over a larger one — than it 
would be for it to comprehend a visual one. 



16. Iconicity in sign and speech 

At long last, we are positioned to consider iconicity and the 
semantic urge in sign language. 1 Individual signs often con- 
tain handshapes, places of articulation or movements that 
relatively straightforwardly refer to one or more parts of an 
object, action, property or manner. Iconicity continues to 
affect online lexical processing of signs in adults [84,85]. 
And iconicity in sign language extends into sign language 
syntax, in the form of spatial pronominal and deictic refer- 
ence, and as classifiers that modify subjects and verbs. 
Iconic signs can vary across languages (e.g. 'tree' outlined 
with two fingers versus represented as a trunk and branches 
with a raised forearm). The referent may be difficult for a sign 
language-naive hearing observer to guess; Klima & Bellugi 
[86] classified the iconicity of signs on a scale ranging from 
transparent to obscure, based on how evident the mappings 
were to naive observers. But there is little doubt that iconicity 
is more prevalent in signed languages than it is in spoken 
languages. By Klima and Bellugi's measure of transparency, 
the most strongly iconic speech examples (outside of a 
class of onamatopoetic speech sounds and ideophones) 
would probably rate as 'obscure' — for example, the high 
vowel for 'me' or 'here' may both indicate closeness to the 
speaker, while a lower vowel for 'you' or 'there' may indicate 
greater distance. 

The primary reason for this visual/auditory difference in 
the prevalence of iconicity in sound and sign is simply that 



the visual system dominates the primate neocortex, making 
vision the modality with the greatest iconic potential; it has 
the most diverse set of representational machinery for charac- 
terizing objects and features, and actions, paths and manners. 
But vision did not always win. We have already mentioned 
bats and electric fish. But consider catfish, which remarkably 
use taste as a distal sense; the catfish brain has detailed and 
repeated maps of taste buds [87], which are distributed 
across not only its barbels and face, but cover its body all 
the way down to the tail, and higher-level areas in the catfish 
brain must do serial assembly of gustatory glances. 

Another reason that visual iconicity is easier than auditory 
iconicity is the intrinsic dimensionality of the respective recep- 
tor surfaces and cortical maps. The visual receptor surface is 
two-dimensional (eccentricity and polar angle) while the audi- 
tory receptor surface is one-dimensional (frequency). Spatial 
coordinates and relations among actors and objects in scenes 
take a lot more work to construct using activity spread 
across auditory bandpass filters than they do from the more 
camera-like retinal movies relayed to VI. 

From the view of this paper, a reason why language prob- 
ably had a vocal origin is precisely that the more natural 
visual iconicity of gestures might have been an impediment 
to building up a large enough catalogue of meaningless 
pre-symbolic gesture strings in evolutionary time. 

But once the neural structures responsible for the control of 
hierarchically structured vocal motor patterns were fixed in the 
human brain by Darwinian evolution, iconicity would no 
longer serve an impediment and could instead turn into a con- 
venient aid to sign language learning [88]. That it is not as 
readily accessible in the auditory modality is auditory's loss. 

Finally, the idea of language in sighted humans as pri- 
marily code-directed visual scene comprehension introduced 
above provides a rationale for why auditory symbols might 
initially have been preferred as opposed to visuomanual sym- 
bols; given a vision-dominant diurnal animal, there would be 
more overlap and potential interference between symbol 
strings and meaning strings in the second case. 



17. Conclusion 

Human auditory -vocal language was unlikely to have arisen 
by the multiplication and concatenation of the small set of 
hard-wired, meaningful, vocal alarm calls that have evolved 
in a large number of different animal species. Evolutionary 
constraints keep calls tightly bound to a small number of 
emotional states as honest life-and-death signals, which pre- 
vents them from being been converted into the low-cost 
strings of thousands of phonemes used to communicate 
linguistic semantics. 

Rather, we turned to plausible Darwinian mechanisms — in 
particular, sexual selection — to explain the evolution of the 
neural control of complex learned vocalization attractive to 
mates for its complexity and variation, but signifying nothing 
specific. Birdsong and whale song are two parallel examples of 
how complex neural machinery for auditory-vocal sequence 
learning evolved by sexual selection without a semantic func- 
tion. Attaching meaning to pre-symbols before they were 
numerous enough, or before they were capable of being put 
together fluently into long enough chains, risks a collapse 
back to a set of isolated meaningful calls. Instead, a large 
inventory of pre-symbols needed to be initially hidden from 
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meaning, but still 'in plain sight', as it were. It was probably 
easier to hide proto-information in auditory-vocal sequences 
than in visual-manual gesture sequences because of greater 
natural iconicity associated with visual-manual gestures. 
This was the first pre-adaptation for language and may have 
taken several million years to fully develop. 

The second pre-adaptation for language was a system for 
assembling visual scene representations from long chains of 
successive glances for the purpose of directing ongoing situ- 
ated behaviour. This function was initially strongly tied to 
content of the current scene, specifically a visual scene, 
because vision is the primary distal sense in primates. 

The origin of language may have been the result of 
the fortuitous coming together of these two initially indepen- 
dent Darwinian pre-adaptations. The result was the 
formation of a second, higher-level symbol-using system 
that has essentially detached itself from the constraints on 
the evolutionary psychology of vocal control that are built 
into the underlying Darwinian DNA-and-protein symbol- 
using system. The new system was also super-charged 
relative to the cellular-level symbol system because of the 
added facility for symbol-string production that is not 
found in the comprehension-only cellular system. 

There remains the question of what might have accounted 
for the coming together of these two systems. After a long 
paper spent criticizing the idea of the semantic urge, perhaps 
the semantic urge is just what was needed here! The main 
argument above was not that the urge to mean did not exist, 
or was irrelevant, but rather that its typical stable evolutionary 
end state, discovered by countless lineages of animal species, 
was a system of isolated meaningful alarm and affiliative 
calls. Perhaps, the same urge to mean — when faced with the 
unique context of (i) an elaborate primate visual system and 
(ii) sexually selected receptive and productive facilities for 
vocal learning of long random-appearing strings — resulted in 
a new and revolutionary outcome of human language. This 
explanation is certainly too simple-minded; but perhaps it 
gives us a fresh jumping-off point. 



Looking at language today, we see a multi-modal, multi- 
channel phenomenon that allows humans to control each 
other's minds in a much more specific and invasive way 
than any other animal can, by injecting potent linguistic 
symbol strings into each other's brains. Modern language 
has lost its fear of iconicity, both in development and adult- 
hood. But maybe none of those features was there until 
the very end. A long-standing problem in the origin of 
human language is its sudden appearance. Modern- 
appearing cultural artefacts only appear on the scene in the 
Late Pleistocene; yet it must have taken a much longer time 
for the anatomical and neural structures that control human 
vocal language production to have evolved from their primi- 
tive basal mammalian condition seen in all other anthropoid 
primates. The birdsong/'RNA-world' picture described 
above provides one way out of this problem. Perhaps non- 
semantic, birdsong-like human speech was around for a 
very long time, perhaps even dating back to early Homo 
species. This could have set the stage for the emergence of 
a linguistic 'RNA-world', where the word recognition and 
chain assembly properties of still-meaningless speech- 
sound representations could be leisurely discovered, and 
then only grafted onto a pre-existing productive visual mean- 
ing construction system at the very end. Kendon [68] recently 
commented that language is 'a poly-modalic activity today, 
[and] so it must have been in its beginnings'. I would 
agree — if 'beginnings' is taken to mean 'the very last minute'. 
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Endnote 

1 Curiously enough, the present analogy between cellular and linguis- 
tic code-using systems was first developed while preparing a talk on 
iconicity in American Sign Language for David MacNeill's 1980 class 
on language and gesture. 
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